Pattern Matching and Pattern Discovery Algorithms for Protein Topologies

نویسندگان

  • Juris Viksna
  • David R. Gilbert
چکیده

We describe algorithms for pattern matching and pattern learning in TOPS diagrams (formal descriptions of protein topologies). These problems can be reduced to checking for subgraph isomorphism and finding maximal common subgraphs in a restricted class of ordered graphs. We have developed a subgraph isomorphism algorithm for ordered graphs, which performs well on the given set of data. The maximal common subgraph problem then is solved by repeated subgraph extension and checking for isomorphisms. Despite the apparent inefficiency such approach gives an algorithm with time complexity proportional to the number of graphs in the input set and is still practical on the given set of data. As a result we obtain fast methods which can be used for building a database of protein topological motifs, and for the comparison of a given protein of known secondary structure against a motif database. 1 Biological motivation Once the structure of a protein has been determined, the next task for biologist is to find hypotheses about its function. One possible approach is pairwise comparison of the structure with the structures of proteins whose functions are already known. There are already several tools that allow such comparisons, for example DALI [7] (http://www.ebi.ac.uk/dali/) or CATH [11] (http://www.biochem.ucl.ac.uk/bsm/cath/). However there are two weaknesses with such approach. Firstly, as the number of proteins with given structure is growing the time needed to do such comparisons is also growing. Currently there are about 15000 protein structure descriptions deposited in the Protein ∗ Supported by a Wellcome Trust International Research Award Data Bank [1] (http://www.rcsb.org/pdb/), but in the future this number may grow significantly. Secondly, even if a similarity with one or more proteins has been found, it may not be apparent whether this may also imply functional similarity, especially if the similarity is not very strong. Another possibility is to try to use a similar approach at a structure level to that used for sequences in PROSITE database [6] (http://ca.expasy.org/prosite/). That is pre-compute a database of motifs for proteins with known structures – i.e. structural patterns which are associated with some particular protein function. This effectively requires comnputing the maximal common substructure for a set of structures. One such approach is that of CORA [10], based on multiple structural alignments of protein sequences for given CATH families. Both of these approaches have been successfully used for protein comparison on the sequence level. The main difficulty in adapting them to the structural level is the complexity of the necessary algorithms – whilst exact sequence comparison algorithms work in linear time, exact structure comparison algorithms may require exponential time and the situation only gets worse with algorithms for finding maximal common substructures. Another aspect of the problem is that it is far from clear which is the best way to define structure similarity. There are many possible approaches, which require different algorithmic methods and are likely to produce different results. Our work is aimed at the development of efficient comparison and maximal common substructure algorithms using TOPS diagrams for structural topology descriptions, defining structure similarity in a natural way that arises from such formalisation, and at the evaluation of usefulness of such approach. The drawback of such an approach is that TOPS diagrams are not very rich in information; however it has the advantage that it is still possible to define practical algorithms for this level of abstraction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Local Derivative Pattern with Smart Thresholding: Local Composition Derivative Pattern for Palmprint Matching

Palmprint recognition is a new biometrics system based on physiological characteristics of the palmprint, which includes rich, stable, and unique features such as lines, points, and texture. Texture is one of the most important features extracted from low resolution images. In this paper, a new local descriptor, Local Composition Derivative Pattern (LCDP) is proposed to extract smartly stronger...

متن کامل

Pattern discovery methods for protein topology diagrams

We are carrying out research into developing several approaches to pattern discovery in protein topology diagrams, and comparing them. The underlying motivation is to eeciently automatically generate patterns classifying sets of proteins and to apply this to characterising databases of protein structure. We are using TOPS protein topology diagrams, which we have formalised as a restricted kind ...

متن کامل

Algorithms for discovering repeated patterns in multidimensional representations of polyphonic music∗

In this paper we give an overview of four algorithms that we have developed for pattern matching, pattern discovery and data compression in multidimensional datasets. We show that these algorithms can fruitfully be used for processing musical data. In particular, we show that our algorithms can discover instances of perceptually significant musical repetition that cannot be found using previous...

متن کامل

High Performance Pattern Matching on Heterogeneous Platform

Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expan...

متن کامل

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

The classificatory power of a pattern is measured by how well it separates two given sets of strings. This paper gives practical algorithms to find the fixed/variable-length-don’t-care pattern (FVLDC pattern) and approximate FVLDC pattern which are most classificatory for two given string sets. We also present algorithms to discover the best window-accumulated FVLDC pattern and window-accumulat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001